Learning Interestingness Measures in Terminology Extraction. A ROC-based approach
نویسندگان
چکیده
In the field of Text Mining, a key phase in data preparation is concerned with the extraction of terms, i.e. collocation of words attached to specific concepts (e.g. Philosophy-Dissertation). In this paper, Term Extraction is formalized as a supervised learning task, extracting a ranking hypothesis from a set of terms labeled as relevant/irrelevant by the expert. This task is tackled using the evolutionary algorithm ROGER, optimizing the area under the ROC curve attached to a ranking hypothesis. Empirical validation on two real-world applications demonstrates outstanding improvements compared to state-of-art interestingness measures in Term Extraction. The approach is found robust across domains (Molecular Biology, Curriculum Vitæ) and languages (English, French).
منابع مشابه
Learning to Order Terms: Supervised Interestingness Measures in Terminology Extraction
Term Extraction, a key data preparation step in Text Mining, extracts the terms, i.e. relevant collocation of words, attached to specific concepts (e.g. genetic-algorithms and decisiontrees are terms associated to the concept “Machine Learning” ). In this paper, the task of extracting interesting collocations is achieved through a supervised learning algorithm, exploiting a few collocations man...
متن کاملPreference Learning in Terminology Extraction: A ROC-based approach
A key data preparation step in Text Mining, Term Extraction selects the terms, or collocation of words, attached to specific concepts. In this paper, the task of extracting relevant collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as relevant/irrelevant. The candidate terms are described along 13 standard statistical criteria meas...
متن کاملBagging Evolutionary ROC-based Hypotheses Application to Terminology Extraction
The claim of the paper is that Evolutionary Learning is a source of diverse hypotheses “for free”, and this specificity can be used to combine in an ensemble the hypotheses learned in independent runs. The aim of our algorithm named Broger (Bagging-ROC GEnetic LEarneR) consists of optimizing the Area Under the ROC Curve using Evolutionary Learning. This paper first presents the theoretical fram...
متن کاملA Graph-based Clustering Approach to Evaluate Interestingness Measures: A Tool and a Comparative Study
Finding interestingness measures to evaluate association rules has become an important knowledge quality issue in KDD. Many interestingness measures may be found in the literature, and many authors have discussed and compared interestingness properties in order to improve the choice of the most suitable measures for a given application. As interestingness depends both on the data structure and ...
متن کاملCategorization of interestingness measures for knowledge extraction
Finding interesting association rules is an important and active research field in data mining. The algorithms of the Apriori family are based on two rule extraction measures, support and confidence. Although these two measures have the virtue of being algorithmically fast, they generate a prohibitive number of rules most of which are redundant and irrelevant. It is therefore necessary to use f...
متن کامل